10-315 Intro to Machine Learning (SCS MAjors)
Lecture 3: Decision Trees - Overfitting

Leila Wehbe
Carnegie Mellon University
Machine Learning Department


Lecture based on material from Tom Mitchell's lecture 2, Nina Balcan's lecture 2, and on on Kilian Weinberger's lecture 17

Lecture outcomesΒΆ

  • What is a decision tree
  • How to use information gain as a heuristic to building a short tree
  • Notion of overfitting

Links (use the version you need)ΒΆ

  • Notebook

  • PDF slides

Supervised Learning Problem StatementΒΆ

The goal is to learn a function $c^*$ that maps input variables $X$ to output variables $y$, based on a set of labeled training examples.

  • classification: $y$ is binary or multiclass
  • regression: $y$ is continuous

Training Data: Given a training set of $n$ labeled examples:$\{(X^{(1)}, y^{(1)}), (X^{(2)}, y^{(2)}), \dots, (X^{(n)}, y^{(n)})\},$ where $X_i \in \mathcal{X} $ represents the input features and $y_i \in \mathcal{Y} $ represents the corresponding labels, the goal is to estimate the optimal function $c^*$ that best predicts the labels for new, unseen data.

Hypothesis Space: The function $c^*$ is chosen from a family of hypotheses $ \mathcal{H} $. That is, $ c^* \in \mathcal{H} $, where $ \mathcal{H}$ represents the set of all possible functions that could map inputs to outputs.

Learning Rule: A learning rule is applied to select the optimal function $c^* $ from the hypothesis space $\mathcal{H}$. The learning rule is typically defined based on an optimization algorithm that seeks to minimize a cost function over the training data.

Function approximationΒΆ

Problem setting:ΒΆ

  • Set of possible instances $X$
  • Unknown target function $f:X\rightarrow Y$
  • Set of candidate hypotheses $\mathcal{H}={h | h:X\rightarrow Y}$

Input:ΒΆ

  • Training examples $\langle X^{(i)},y^{(i)} \rangle$ of unknown target function $f$

Output:ΒΆ

  • Hypothesis $c^* \in H$ that best approximates target function $f$

Decision treesΒΆ

Learn concept PlayTennis (i.e., decide whether our friend will play tennis in a given day)

image.png

PLAY TENNIS?ΒΆ

  • A Decision tree for $f$: (Outlook, Temperature, Humidity, Wind) $\rightarrow$ PlayTennis? image-2.png
  • Each internal node: test one discrete-valued attribute $X_d$
  • Each branch from a node: selects one value for $X_d$
  • Each leaf node: predict $Y$ (or $P(Y | X \in \text{leaf})$)

Example: x=(Outlook=sunny, Temperature-Hot, Humidity=Normal,Wind=High), h(x)=Yes

Function approximationΒΆ

Problem setting:ΒΆ

  • Set of possible instances $X$
    • example: Outlook, Temperature, Humidity, Wind
  • Unknown target function $f:X\rightarrow Y$
    • example: Y is binary (play or not play)
  • Set of candidate hypotheses $\mathcal{H}={h | h:X\rightarrow Y}$
    • example: each $h$ is a decision tree

Decision tree exampleΒΆ

  • Suppose $X = (X_1,X_2,… X_n)$ where $X_i$ are boolean-valued variables

  • How would you represent $Y = X2\land X5?$ $~ ~ ~ ~ ~ $ $Y = X2\lor X5?$

image.png

How would you represent $X_2X_5\lor X_3X_4(\neg X_1)?$

Example with training dataΒΆ

Draw a tree that represents:

$$X_1$$ $$X_2$$ $$X_3$$ $$Y$$
0 0 0 0
0 0 1 0
0 1 0 1
0 1 1 1
1 0 0 1
1 0 1 1
1 1 0 1
1 1 1 1
  • The order we pick the attributes affects the size (and efficiency) of the tree.

How to choose the best hypothesis?ΒΆ

  • How to automatically find a good hypothesis for training data?

    • A core algorithmic question.
  • When do we generalize and do well on unseen data?

    • A learning theory question.
    • Occam’s razor: use the simplest hypothesis consistent with data!
  • Occam’s razor: Fewer short hypotheses than long ones

    • a short hypothesis that fits the data is less likely to be a statistical coincidence
    • highly probable that a sufficiently complex hypothesis will fit the data
    • (remember this is a heuristic. read here for more discussion)

How to choose the best hypothesis?ΒΆ

  • How to automatically find a good hypothesis for training data?

    • A core algorithmic question.
  • When do we generalize and do well on unseen data?

    • A learning theory question.
    • Occam’s razor: use the simplest hypothesis consistent with data!
  • Other core questions in machine learning:

    • How do we choose a hypothesis space?
      • Often we use prior knowledge to guide this choice
    • How to model applications as machine learning problems?
      • engineering challenge

How to choose the best Decision Tree?ΒΆ

  • How to pick the smallest tree?

    • NP-hard [Hyafil-Rivest’76]!
  • Luckily, we have very nice practical heuristics and top down algorithms (e.g, ID3) that can achieve a good solution

Top-Down Induction of Decision Trees [examples ID3, C4.5, Quinlan]ΒΆ

  • ID3: Natural greedy approach to growing a decision tree top-down (from the root to the leaves by repeatedly replacing an existing leaf with an internal node.).

  • Algorithm main loop:

    • Pick β€œbest” attribute A to split at a node based on training data.
    • Assign A to this node.
    • For each value of A create new descendent.
    • Split training examples among the descendents.
    • If training examples perfectly classified in new node, stop, else recurse on node.
  • How to know which attribute is best?

Which attribute is best?ΒΆ

  • ID3 uses a statistical measure called information gain (how well a given attribute separates the training examples according to the target classification)

  • Information Gain of variable $A$ is the expected reduction in entropy of target variable $Y$ for data sample $S4$, due to sorting on variable $A$

$$\text{Gain}(S,A) = H_S(Y) - H_S(Y|A)$$

  • Entropy ($H_S$) is information theoretic measure that characterizes the complexity of a labeled set 𝑆.

EntropyΒΆ

  • Sample Entropy of a Labeled Dataset
    • $𝑆$ is a sample of training examples
      • $p_1$ is the proportion of positive examples in $𝑆$
      • $p_0$ the proportion of negative examples in $𝑆$

β€’ Entropy measures the complexity of 𝑆: $$H_S = - p_1 \log(p_1) - p_0 \log(p_0)$$ image-5.png

EntropyΒΆ

$$H_S = - p_1 \log(p_1) - p_0 \log(p_0)$$ image-5.png

  • E.g., if all negative, then entropy=0. If all positive, then entropy=0.
  • If 50/50 positive and negative then entropy=1.
  • If 14 examples with 9 positive and 5 negative, then entropy=.940

If labels not BooleanΒΆ

$$H_S = - \sum_{k=1}^c p_k \log_2(p_k)$$

E.g., if $c$ classes, all equally likely, then $p_k= \frac{1}{c}$ for all $k$ and $H_S = -\log_2\frac{1}{c}$

Another way to understand maximizing information gainΒΆ

  • We want to be the farthest possible from all classes being equally likely (let's call this uniform distribution $q_k= \frac{1}{c}$ for all $k$). This can be done by finding another distribution $(p_1,...p_c)$ that maximizes the KL-Divergence (Kullback–Leibler divergence). \begin{align*} \text{KL}(p \parallel q) &= \sum_{k=1}^c p_k \log \frac{p_k}{q_k} \end{align*}
  • The KL-Divergence is not a metric because it is not symmetric, i.e., $\text{KL}(p||q)\ne \text{KL}(q||p)$.

$ \text{KL}(p \parallel q) = \sum_{k=1}^c p_k \log \frac{p_k}{q_k} \geq 0 \quad \leftarrow \text{KL-Divergence} $

$ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~= \sum_k p_k \log(p_k) - p_k \log(q_k) \quad \text{where } q_k = \frac{1}{c} $

$ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~= \sum_k p_k \log(p_k) + p_k \log(c) $

$ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~= \sum_k p_k \log(p_k) + \log(c) \sum_k p_k \quad \text{where } \log(c) \text{ is constant, } \sum_k p_k = 1 $

$ \max_p \text{KL}(p \parallel q) = \max_p \sum_k p_k \log(p_k) $

$~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~= \min_p -\sum_k p_k \log(p_k) = \min_p H(s) \quad \leftarrow \text{Entropy} $

Information gain of a splitΒΆ

  • Information Gain of variable $A$ is the expected reduction in entropy of target variable $Y$ for data sample $S4$, due to sorting on variable $A$ (with $c$ values):

$$\text{Gain}(S,A) = H_S(Y) - H_S(Y|A)$$

$$\text{Gain}(S,A) = \text{Entropy}(S) - \sum_{k=1}^c \frac{|S_k|}{S} \text{Entropy}(S_k)$$

  • Gain(S,A) is the information provided about the target function, given the value of some other attribute A.

Learn concept PlayTennis (i.e., decide whether our friend will play tennis in a given day) image-2.png

image-2.png

image-2.png

image.png

image-3.png

Final decision treeΒΆ

image.png

Properties of ID3ΒΆ

image.png

  • ID3 performs heuristic search through space of decision trees.
  • It stops at smallest acceptable tree. (Occam's razor).
  • Still a greedy approach, might not find the shortest tree.

ID3 might still overfit!ΒΆ

  • Overfitting could occur because of noisy data and because ID3 is not guaranteed to output a small hypothesis even if one exists.

  • Consider adding a noisy example:

    • Sunny, Hot, Normal, Strong, PlayTennis = No
  • The tree we learned would not be compatible with the training data image-7.png

OverfittingΒΆ

  • Consider a hypothesis h and its
    • Error rate over training data: $error\_train(h)$
    • True error rate over all data: $error\_true(h)$
  • We say h overfits the training data if $error\_true(h) > error\_train(h)$
    • We typically don't know $error\_true(h)$ but we can estimate $error\_test(h)$ on a heldout set from the same distribution as the training data.
  • Amount of overfitting = $error\_true(h)-error\_train(h)$

Example of overfitting in ID3ΒΆ

Task: learning which medical patients have a form of diabetes. image.png

How can we avoid overfitting?ΒΆ

  • Stop growing when data split not statistically significant
  • Grow full tree, then prune it
  • example: Reduced Error Pruning
    • Split data into training set and validation set
    • Train a tree to classify training set as well as possible
    • Do until further pruning is harmful:
      1. For each internal tree node, consider making it a leaf node (pruning the tree below it)
      2. Greedily chose the above pruning step that best improves error over validation set
    • Produces smallest version of the most accurate pruned tree

Effect of reduced error pruningΒΆ

image-2.png

NOTE: the test set should NEVER be used for training. A separate validation set is used for making decisions about pruning.ΒΆ

What if my attributes $X$ are real valued?ΒΆ

  • Use a decision stump: for each attribute, consider splitting above, below
    • e.g. (is Temperature $\ge$ 72)

image.png

Ensemble learning, baggingΒΆ

  • Using ensemble learning with trees makes them work very well in practice
    • instead of one tree, create a forest of trees and combine their prediction
  • Bagging: resample the training dataset with replacement and average the trees
  • Random forests: use subsets of the data (different features)
  • Will see ensemble learning later in the couse